{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# NLP Architect - Intent Extraction tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "Let's start by importing all the important classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from nlp_architect.models.intent_extraction import MultiTaskIntentModel\n",
    "from nlp_architect.data.intent_datasets import SNIPS\n",
    "from nlp_architect.utils.embedding import load_word_embeddings\n",
    "from nlp_architect.utils.metrics import get_conll_scores\n",
    "from nlp_architect.utils.generic import one_hot\n",
    "\n",
    "from tensorflow.python.keras.utils import to_categorical"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Preparing the data\n",
    "The first step is the download the dataset into a folder and load the data into the memory\n",
    "using the `SNIPS` data loader.\n",
    "\n",
    "### SNIPS NLU Benchmark dataset\n",
    "\n",
    "SNIPS dataset has 7 types of intents:\n",
    "- ‘Add to playlist’\n",
    "- ‘Rate book’\n",
    "- ‘Check weather’\n",
    "- ‘Play music’\n",
    "- ‘Book restaurant’\n",
    "- ‘Search event’\n",
    "- ‘Search art’\n",
    "\n",
    "73 types of labels (including `B-` and `I-` prefixed labels), train/test set sizes: ~14000/700\n",
    "\n",
    "More info: [here](https://github.com/snipsco/nlu-benchmark)\n",
    "\n",
    "(The terms and conditions of the data set license apply. Intel does not grant any rights to the data files)\n",
    "\n",
    "Git clone the repository with the dataset:\n",
    "```\n",
    "git clone https://github.com/snipsco/nlu-benchmark.git\n",
    "```\n",
    "\n",
    "Point the source of the dataset to `nlu-benchmark/2017-06-custom-intent-engines/` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence_length = 50\n",
    "word_length = 12"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dataset_path = 'nlu-benchmark/2017-06-custom-intent-engines/'\n",
    "dataset = SNIPS(path=dataset_path,\n",
    "                sentence_length=sentence_length,\n",
    "                word_length=word_length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the dataset is loaded, we can extract the ready made `train` and `test` sets. Each set is made up of a tuple of 4 elements:\n",
    "- Words (`train_x` and `test_x`)\n",
    "- Word character representation (`train_c` and `test_c`)\n",
    "- Intent type (`train_i` and `test_i`)\n",
    "- Token slot tags (`train_y` and `test_y`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_x, train_c, train_i, train_y = dataset.train_set\n",
    "test_x, test_c, test_i, test_y = dataset.test_set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_x.shape, train_c.shape, train_i.shape, train_y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sentences are encoded in sparse int representation (str->int vocabularies stored in the dataset object) as NumPy arrary.\n",
    "Lets look at the sentence in index 5544, translate it back to strings so we could read the sentence, and look at the encoded label tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "train_x[5544]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[dataset.word_vocab.id_to_word(i) for i in train_x[5544] if dataset.word_vocab.id_to_word(i) is not None]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[dataset.tags_vocab.id_to_word(i) for i in train_y[5544] if dataset.tags_vocab.id_to_word(i) is not None]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### External word embedding\n",
    "\n",
    "Now it's time to load the external word embedding model.\n",
    "We'll use `load_word_embeddings` function that reads the file and loads up the words into numpy arrays.\n",
    "Once done, we'll create a 2D array with the words we have in our dataset word lexicon - we'll save it in `embedding_matrix` and we'll use it later when we load the embedding layer of the words.\n",
    "\n",
    "You can download the GloVe word embedding models from [here](https://nlp.stanford.edu/projects/glove/).\n",
    "\n",
    "(The terms and conditions of the data set license apply. Intel does not grant any rights to the data files)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from nlp_architect.utils.embedding import get_embedding_matrix\n",
    "\n",
    "embedding_path = 'glove.6B.100d.txt'\n",
    "embedding_size = 100\n",
    "\n",
    "embedding_model, _ = load_word_embeddings(embedding_path)\n",
    "embedding_mat = get_embedding_matrix(embedding_model, dataset.word_vocab)\n",
    "\n",
    "embedding_mat.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Building the network\n",
    "\n",
    "Now for the fun part, let's start by defining the parameters of the network we're going to build, such as, the LSTM layer's hidden state, the number of output labels and intents to predict and the size of the character embedding vectors.\n",
    "\n",
    "The network topology looks as the following diagram\n",
    "\n",
    "### High level topology\n",
    "\n",
    "![image.png](attachment:image.png)\n",
    "\n",
    "This network is defined in `nlp_architect.models.intent_extraction` packages as `MultiTaskIntentModel`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first convert the slot labels an intent classifications into 1-hot encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_y = to_categorical(test_y, dataset.label_vocab_size)\n",
    "train_y = to_categorical(train_y, dataset.label_vocab_size)\n",
    "train_i = one_hot(train_i, len(dataset.intents_vocab))\n",
    "test_i = one_hot(test_i, len(dataset.intents_vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define the input and output data sources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_inputs = [train_x, train_c]\n",
    "train_outs = [train_i, train_y]\n",
    "test_inputs = [test_x, test_c]\n",
    "test_outs = [test_i, test_y]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We initiate the model object and build the network with the defined parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MultiTaskIntentModel()\n",
    "model.build(dataset.word_len,\n",
    "            dataset.label_vocab_size,\n",
    "            dataset.intent_size,\n",
    "            dataset.word_vocab_size,\n",
    "            dataset.char_vocab_size,\n",
    "            word_emb_dims=embedding_size,\n",
    "            tagger_lstm_dims=100,\n",
    "            dropout=0.2)\n",
    "model.load_embedding_weights(embedding_mat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the network\n",
    "We've got a model, it's time to train the network.\n",
    "\n",
    "We define the batch size and the number of epochs to run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "batch_size = 32\n",
    "no_epochs = 5\n",
    "\n",
    "# train the model\n",
    "model.fit(train_inputs, train_outs,\n",
    "          batch_size=batch_size,\n",
    "          epochs=no_epochs,\n",
    "          validation=(test_inputs, test_outs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing and evaluating the network\n",
    "Great! we have a trained model, let's check how well it performs.\n",
    "\n",
    "First, we need to run all the test data through the network and get the network's preditions. Once done, we can use `get_conll_scores` to get the actual CONLLEVAL benchmark results on the test data (in terms of precision/recall/F1 and per label type)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predictions = model.predict([test_x, test_c], batch_size=batch_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predictions[0].shape, predictions[1].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "eval = get_conll_scores(predictions, test_y,\n",
    "                            {v: k for k, v in dataset.tags_vocab.vocab.items()})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overall performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(eval)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Per label performance breakdown"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Intent classification accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "predicted_intents = predictions[0].argmax(1)\n",
    "truth_intents = test_i.argmax(1)\n",
    "accuracy_score(truth_intents, predicted_intents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using GloVe 300 word embedding model and 50+ epochs of training should produce a model with:\n",
    "\n",
    "- Intent detection: >99 F1\n",
    "- Slot label classification: >95 F1\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
